UPDATED April 2019
I obtained the Chicago Police Department’s gang database from ProPublica. It’s a dataset of CPD-classified as gang members. However, it was found to be full of errors and inaccuracies, including outdated information like gang members who are still active at the ripe old age of 132.
I made this gang database into a choropleth map of Chicago, which displays shading based on the number of individuals classified as gang members in each of Chicago’s police beats (a police beat is basically a patrol area assigned to a group of officers).
This map provides an accessible way to display how many people were put into the database and the police beat under which they were classified.
To see the map and conclusions, skip to the bottom! What follows is my data analysis process.
Read in the first gang data file from March 2018 (CPD gang database 3-18.xlsx).
Read in the second gang data file from November 2017 (CPD gang database 11-17.xlsx).
Then we read in the shapefile of the CPD police beats, so we can map them.
#read in both databases
df_gang_318 <- read_excel("CPD Gang Data/CPD gang database 3-18.xlsx", sheet=1)
df_gang_1117 <- read_excel("CPD Gang Data/CPD gang database 11-17.xlsx", sheet=1)
# Set up the Chicago Police Beats Shapefile, downloaded from the Chicago data portal
map_filepath <- "Boundaries_Police Beats/geo_export_CPD_BEATS.shp"
cpd_beats <- st_read(map_filepath)
## Reading layer `geo_export_CPD_BEATS' from data source `/Users/PrincessO/code/CPD Gang database/Boundaries_Police Beats/geo_export_CPD_BEATS.shp' using driver `ESRI Shapefile'
## Simple feature collection with 277 features and 4 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
After reading in the data, I’m cleaning it below by dropping any rows that don’t have a police beat listed for the officer who created the entry or the location of the arrest, since we can’t map police beats that aren’t in the dataset.
#Cleaning the data -- take out any rows with NA for a police beat since we can't map those
df_gangs_318_ONLYBEATS <- filter(df_gang_318, !is.na(O_BEAT))
df_gangs_1117_ONLYBEATS <- filter(df_gang_1117, !is.na(BEAT_FIRST_ARREST))
Interestingly, after filtering out the nulls, I found that there seem to be a lot of rows/entries that don’t have a police beat.
So in both datasets, it seems there are about 39,000 entries in the database that have no police beat. This may be something to look into later!
To start, let’s take a look at the CPD Police Beats shapefile. This preliminary map includes a popup of the police beat, and will be the base for further exploration and mapping.
#this is just a quick map to map the police beats
cpd_beats %>%
leaflet() %>%
addTiles() %>%
addPolygons(popup=~beat_num)
Before we join the gang database to the location of the police beats, we need to standardize the police beat number. Both databases contain police beat numbers that are sometimes 3 or 4 digits. So this step just adds a “0” to the beginning of any 3-digit number, to keep everything uniform (Example: Police Beat “111” become “0111” after this step).
# it looks like both dfs need a padded 0 in front of some of the numbers to make them 4 digits instead of 3. So let's clean that up before we join it with cpd_beats (which is all 4 digits)
df_gangs_318_ONLYBEATS$O_BEAT <- str_pad(df_gangs_318_ONLYBEATS$O_BEAT, 4, pad = "0")
df_gangs_1117_ONLYBEATS$BEAT_FIRST_ARREST <- str_pad(df_gangs_1117_ONLYBEATS$BEAT_FIRST_ARREST, 4, pad = "0")
Now that both datasets have a uniform police beat number, I can join them to the location of the police beats and map them out. I created a new dataframe from it with a summary of the number of arrests per police beat (num_arrests).
Steps below include:
#inner join the two data frames. Used an inner join to drop any rows where the two police beat numbers didn't match up
beats_plus_gangs_318<- inner_join(cpd_beats, df_gangs_318_ONLYBEATS, by=c('beat_num'='O_BEAT'))
beats_plus_gangs_1117<- inner_join(cpd_beats, df_gangs_1117_ONLYBEATS, by=c('beat_num'='BEAT_FIRST_ARREST'))
Looks like the inner join dropped a few thousand rows from each database because they didn’t match up. Making another note to investigate what got dropped because there was no match, since I’m thinking the cpd_beats file and the cleaning should have worked so that everything matches up. Maybe a few of the beats were entered wrong/user error? But an inner join dropping anywhere from 1k - 5k rows from a 85k row dataset doesn’t seem too bad or alarming.
#get the number of arrests per police beat and create a new data frame (CPD_gang_map2) with just that info
CPD_gang_map2 <- beats_plus_gangs_1117 %>%
group_by(beat_num) %>%
summarize(num_arrests=n())
Darker blue areas indicate more gang arrests, as indicated in the 11-17 database. Select the community area to view a popup of the police beat and the number of arrests.
#set the color pallette as Blues.
col_pal <- colorNumeric("Blues", domain=CPD_gang_map2$num_arrests)
#Add the pop-up text with Police Beat and Number of Arrests in that area
popup_sb <- paste0("Police Beat: ", as.character(CPD_gang_map2$beat_num), "<br>Number of arrests: ", as.character(CPD_gang_map2$num_arrests))
# leaflet worked to build map1
leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
setView(-87.628598, 41.855372, zoom = 10) %>%
addPolygons(data = CPD_gang_map2,
fillColor = ~col_pal(CPD_gang_map2$num_arrests),
fillOpacity = 0.7,
weight = 0.2,
smoothFactor = 0.2,
popup = ~popup_sb) %>%
addLegend(pal = col_pal,
values = CPD_gang_map2$num_arrests,
position = "bottomright",
title = "Number of Arrests")
Note: I’ve used “Number of Arrests” in the map I created above, since I’ve only used data from the 11-17 database to create the map. The 11-17 database specifically refers to “arrests,” while the 3-18 database merely has the date the person was entered into the database and does not mention arrest as the classification means, so it’s unclear if they were entered because of an arrest or were classified as a gang member by other means.
I determined that the 11-17 database coded gang classification by the date of arrest, while the 3-18 database only coded gang classification by the date the record was created.
Police Beat 0824 had 806 arrests
I choose a map view as easy way to look at the data and see what Chicago areas have the most gang arrests entered into the database. Each police beat is mapped and can be selected to see the number of arrests it contains.